Capstone Project - Predicting House Prices in KING COUNTY

The purpose of this project is to build regression model and apply feature engineering methods to good accuracy ensuring less prediction errors for estimating house prices in KING COUNTY. Geo location analysis using latitudes and longitudines to identify high price locations, addresses of high cost houses, quality house areas and also able to create new gold feature by visualizing the house price outliers The following are the attributes of the dataset cida : notation for a house dayhours : Date house was sold price : Price is prediction target (Target Variable) room_bed : Number of Bedrooms/House room_bath : Number of bathrooms/bedrooms living_measure : Square footage of the home lot_measure : Square footage of the lot ceil : Total floors (levels) in house coast : House which has a view to a waterfront sight : Has been viewed condition : How good the condition is (Overall) quality : grade given to the housing unit, based on grading system ceil_measure : square footage of house apart from basement basement_measure : square footage of the basement yr_built : Built Year yr_renovated : Year when house was renovated zipcode : zip lat : Latitude coordinate long : Longitude coordinate living_measure15 : room area in 2015(implies-- some renovations) lot_measure15 : lotSize area in 2015(implies-- some renovations) furnished : Based on the quality of room total_area : Measure of both living and lot

1. Import Libraries and load dataset

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
In [2]:
# Load the PROCESSED HOUSE Dataset
house_feature_df  = pd.read_csv("house_feature_df.csv")

Geo location analysis using latitudes and longitudines to identify high price locations

In [3]:
#Import GeoPandas package to plot the latitudes and longitudines on a map
import geopandas as gpd
#Import Point and Polygon modules from the package Shapely
from shapely.geometry import Point, Polygon
In [4]:
#Read the shape file of KingCounty village of  Washington, United States of America
KingCounty_Washington_map = gpd.read_file("USA_adm1_select2.shp")
#Plot the map of King County of Washington               
fig, ax = plt.subplots(figsize = (100,100))
KingCounty_Washington_map.plot(color='grey', ax=ax, alpha = 0.4)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0xc674da0>
In [5]:
#Creates the point from the latitude and longitude of the located house.
#Point is essentially a single object that describes the longitude and latitude of a data-point
geometry = [Point(xy) for xy in zip(house_feature_df.long, house_feature_df.lat)]
In [6]:
#Convert the dataframe house_df into the Geo panda data frame house_df_new
#crs = {'init': 'epsg:4326'}
#house_feature_df_Geo = gpd.GeoDataFrame(house_feature_df, crs=crs, geometry=geometry)
house_feature_df_Geo = gpd.GeoDataFrame(house_feature_df,  geometry=geometry) 
In [7]:
#Plot the houses whose price identified as outliers (> 1128000)  

fig, ax = plt.subplots(figsize = (100,100))
KingCounty_Washington_map.plot(color='grey', ax=ax, alpha = 0.9)
house_feature_df_Geo.geometry.plot(marker='.', color = 'Red', ax = ax,alpha=.5, markersize = 500)
house_feature_df_Geo[house_feature_df_Geo['price']>1128000].geometry.plot(marker="s",color = 'green', ax = ax, label = 'x',facecolors="None", alpha=.7, markersize = 1500)
plt.title('Arial View of High Cost Houses Spread',fontsize=150) 
plt.legend()
Out[7]:
<matplotlib.legend.Legend at 0xc594ac8>
High cost houses (represented by green color) are densely populated in middle of KingCount area

Get Sample address of few High cost houses

In [8]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="")
In [9]:
location = geolocator.reverse("47.5306, -122.134")
print(location.address)
15415, Southeast 80th Street, Newcastle, King County, Washington, 98059, USA
In [10]:
location = geolocator.reverse("47.6425, -122.406")
print(location.address)
2558, 38th Avenue West, Magnolia, Seattle, King County, Washington, 98199, USA

Create new column representing Premium house if the cost of house (> 1128000) and store the outfile

In [11]:
house_feature_df['Premium_House']   = ''
rec_count = house_feature_df.shape[0]

for i in range(rec_count):
    if  (house_feature_df['price'][i] >= 1128000):
        house_feature_df['Premium_House'][i] = 1
    else:
        house_feature_df['Premium_House'][i] = 0
In [12]:
house_feature_df.head(1)
Out[12]:
Unnamed: 0 yr_built yr_renovated house_age age_after_renovtion room_bed room_bath living_measure lot_measure ceil ... zipcode lat long living_measure15 lot_measure15 furnished total_area price geometry Premium_House
0 0 1956 0 58 0 4 3.25 3020 13457 1.0 ... 98133 47.7174 -122.336 2120 7553 1 16477 808100 POINT (-122.336 47.7174) 0

1 rows × 26 columns

In [13]:
#Plot Top 3 Rated houses (Condition = 1, 2, 3)   

fig, ax = plt.subplots(figsize = (100,100))
KingCounty_Washington_map.plot(color='grey', ax=ax, alpha = 0.9)
house_feature_df_Geo.geometry.plot(marker='.', color = 'None', ax = ax,alpha=.5, markersize = 500)
house_feature_df_Geo[house_feature_df_Geo['condition'] == 3].geometry.plot(marker="s",color = 'blue', ax = ax, label = 'Category',facecolors="None", alpha=.7, markersize = 100)
house_feature_df_Geo[house_feature_df_Geo['condition'] == 4].geometry.plot(marker="s",color = 'green', ax = ax, label = 'Category',facecolors="None", alpha=.7, markersize = 50)
house_feature_df_Geo[house_feature_df_Geo['condition'] == 5].geometry.plot(marker="s",color = 'black', ax = ax, label = 'Category',facecolors="None", alpha=.7, markersize = 10)
plt.title('Arial View of Top 3 Rated Housing Categories',fontsize=150)
plt.legend()
Out[13]:
<matplotlib.legend.Legend at 0x13090c88>
In [14]:
#Plot Top 2 Quality houses spread (Quality = 7, 8)    

fig, ax = plt.subplots(figsize = (100,100))
KingCounty_Washington_map.plot(color='grey', ax=ax, alpha = 0.9)
house_feature_df_Geo.geometry.plot(marker='.', color = 'Red', ax = ax,alpha=.5, markersize = 500)
house_feature_df_Geo[house_feature_df_Geo['quality'] == 7].geometry.plot(marker="s",color = 'blue', ax = ax, label = 'Quality',facecolors="blue", alpha=1, markersize = 100)
house_feature_df_Geo[house_feature_df_Geo['quality'] == 8].geometry.plot(marker="s",color = 'green', ax = ax, label = 'Quality',facecolors="green", alpha=1, markersize = 10)
plt.title('Arial View of Top 2 Quality Housing areas',fontsize=150)
plt.legend(fontsize=60)
Out[14]:
<matplotlib.legend.Legend at 0x2ae23f28>
Almost half of KingCount houses belong to quality 7 & 8, while quality 7 (represented by blue) dominating area
In [15]:
#Plot the houses having coast facing

fig, ax = plt.subplots(figsize = (100,100))  
KingCounty_Washington_map.plot(color='grey', ax=ax, alpha = 0.9)
house_feature_df_Geo.geometry.plot(marker='.', color = 'red', ax = ax,alpha=.9, markersize = 1)
house_feature_df_Geo[house_feature_df_Geo['coast']==1].geometry.plot(marker="s",color = 'blue', ax = ax, label = 'house_with Water Front ', alpha=1, markersize = 100 )
plt.title('Arial View of the KingCount Houses facing the Water Front',fontsize=150)
plt.legend(fontsize=60) 
Out[15]:
<matplotlib.legend.Legend at 0x3a5228d0>
Very few houses in KingCount area are facing the Water Front
In [16]:
import pylab 
fig, ax = plt.subplots(figsize = (100,100))
KingCounty_Washington_map.plot(color='grey', ax=ax, alpha = 0.4)
house_feature_df_Geo.geometry.plot(marker='^', color = 'red', ax = ax,label = 'Geometry', alpha=.5, markersize = 75 )

#Plot the houses which dont have the coast i.e coast =0
house_feature_df_Geo[house_feature_df_Geo['coast']==0].geometry.plot(marker="s",color = 'green', ax = ax, label = 'house_withCoast',facecolors="None", alpha=.5, markersize = 100 )

##Plot the houses which  have the coast i.e coast 
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x2af25860>
In [17]:
house_feature_df  = house_feature_df.drop(['geometry'], axis=1)
In [18]:
house_feature_df['Premium_House']   = house_feature_df['Premium_House'].astype(np.int64)
In [19]:
# Write outfiles to directory
house_feature_df.to_csv('house_feature_df.csv')

** End of Geographical Analysis *